Gráficas para visualización de temperatura

Del resultado del Análisis de datos exploratorio se carga el archivo parquet con el dataframe para hacer gráficas y visualizaciones.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master('local') \
    .appName('GraficasApp') \
    .config('spark.executor.memory', '2gb') \
    .config("spark.cores.max", "2") \
    .getOrCreate()

sc = spark.sparkContext
In [2]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# cargando el dataset
ndf = sqlContext.read.parquet('hdfs:///datasets/ndf.parquet')
ndf.show(10)
+---+-------------------+------------------+-----------------------------+-----------+--------+--------+---------+---------+
| id|                 dt|AverageTemperature|AverageTemperatureUncertainty|       City| Country|Latitude|Longitude|Elevation|
+---+-------------------+------------------+-----------------------------+-----------+--------+--------+---------+---------+
|  1|1825-01-01 00:00:00|25.331999999999997|                        3.194|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|  2|1825-02-01 00:00:00|25.549000000000003|           1.4709999999999999|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|  3|1825-03-01 00:00:00|            26.285|                        2.193|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|  4|1825-04-01 00:00:00|            26.999|                        2.571|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|  5|1825-05-01 00:00:00|27.450000000000006|                        1.591|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|  6|1825-06-01 00:00:00|27.732000000000006|           0.9009999999999999|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|  7|1825-07-01 00:00:00|            27.765|                         3.14|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|  8|1825-08-01 00:00:00|            26.399|                        2.432|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|  9|1825-09-01 00:00:00|            26.143|                        1.976|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
| 10|1825-10-01 00:00:00|26.040000000000006|                        2.577|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
+---+-------------------+------------------+-----------------------------+-----------+--------+--------+---------+---------+
only showing top 10 rows

In [3]:
ndf.count()
Out[3]:
8235082
In [4]:
ndf = ndf.drop("id")

Este dataframe es muy grande como para convertirlo en un dataframe pandas directamente y hacer gráficas, para cada tipo de gráfica se obtienen datos resumidos.

In [5]:
# convirtiendo fecha 
In [6]:
# agrupando por año
from pyspark.sql.functions import udf
import pandas as pd
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType

def getYear(fecha):
    año = fecha.split("-")[0]
    return año

getYear_udf = udf(getYear, StringType())
In [7]:
ndfa = ndf.withColumn('dt', getYear_udf(ndf["dt"]))
In [8]:
ndfa = ndfa.orderBy(ndf["dt"])

Temperatura promedio por años

In [9]:
# agrupando por años
pndf_años = ndfa.groupBy('dt').avg("AverageTemperature", "AverageTemperatureUncertainty").toPandas()
In [10]:
pndf_años = pndf_años.sort_values(by="dt")
pndf_años = pndf_años.reset_index()
print(pndf_años.columns)
pndf_años.head(5)
Index(['index', 'dt', 'avg(AverageTemperature)',
       'avg(AverageTemperatureUncertainty)'],
      dtype='object')
Out[10]:
index dt avg(AverageTemperature) avg(AverageTemperatureUncertainty)
0 227 1743 4.882424 1.953902
1 175 1744 10.734047 1.844291
2 208 1745 1.497593 1.844013
3 107 1750 9.872808 1.792405
4 226 1751 10.046739 1.733012
In [11]:
import plotly
import plotly.graph_objs as go
import pandas as pd
import numpy as np
from plotly.offline import init_notebook_mode, iplot

init_notebook_mode(connected=True)         # initiate notebook for offline plot
In [12]:
pndf_años['dt'].count()
Out[12]:
267
In [13]:
data = [go.Scatter(
            x=pndf_años["dt"],
            y=pndf_años['avg(AverageTemperature)'],
            name="°C temperatura"),
        go.Scatter(
            x=pndf_años["dt"],
            y=pndf_años['avg(AverageTemperatureUncertainty)'],
        name="% incertidumbre")
       ]

plotly.offline.iplot({
    "data": data,
    "layout": go.Layout(title="Temperatura Promedio Global por año (C°)")
})
In [14]:
ndf.count()
Out[14]:
8235082

Temperatura promedio en países con climas extremos por año

In [15]:
pndf_paises = ndfa.groupBy('Country', 'dt').avg("AverageTemperature", "AverageTemperatureUncertainty").toPandas()
In [16]:
print(len(pndf_paises))
pndf_paises.head(10)
31556
Out[16]:
Country dt avg(AverageTemperature) avg(AverageTemperatureUncertainty)
0 Morocco 1756 17.425792 3.225000
1 Germany 1756 8.638011 5.078406
2 Slovakia 1756 9.146333 4.272292
3 Russia 1759 3.626598 4.602569
4 Morocco 1768 16.275940 4.123964
5 Russia 1770 4.133762 4.113828
6 Bulgaria 1780 10.811905 3.196190
7 Canada 1792 4.728526 2.517953
8 Poland 1795 7.512184 4.488971
9 Hungary 1796 10.006426 3.222639
In [17]:
# determinando paises con temperaturas extremas
pndf_countries = pndf_paises.groupby("Country").agg(np.mean)
mas_frios = pndf_countries.sort_values(by="avg(AverageTemperature)").head(10).index
mas_calientes = pndf_countries.sort_values(by="avg(AverageTemperature)").tail(10).index
print(mas_frios)
print(mas_calientes)
pndf_countries
Index(['Mongolia', 'Iceland', 'Russia', 'Norway', 'Finland', 'Kazakhstan',
       'Estonia', 'Canada', 'Latvia', 'Sweden'],
      dtype='object', name='Country')
Index(['Cambodia', 'Benin', 'Mauritania', 'Guinea Bissau', 'Chad', 'Mali',
       'Burkina Faso', 'Sudan', 'Niger', 'Djibouti'],
      dtype='object', name='Country')
Out[17]:
avg(AverageTemperature) avg(AverageTemperatureUncertainty)
Country
Afghanistan 13.776048 0.983623
Albania 15.498107 1.497499
Algeria 17.764935 1.489659
Angola 21.772917 0.881423
Argentina 17.229089 0.831176
Armenia 8.365495 1.168944
Australia 16.665405 0.634328
Austria 6.252623 1.616855
Azerbaijan 11.099677 1.036308
Bahamas 24.754976 1.131211
Bahrain 25.842994 0.897620
Bangladesh 25.035277 0.889093
Belarus 6.099282 1.531177
Belgium 9.700277 1.622163
Benin 26.972106 0.647919
Bolivia 11.349004 0.910471
Bosnia And Herzegovina 10.417101 1.568765
Botswana 18.992366 0.715813
Brazil 22.144744 0.927640
Bulgaria 10.544790 1.517045
Burkina Faso 27.806737 0.773630
Burma 26.008083 0.931460
Burundi 20.803525 0.728644
Cambodia 26.912157 0.652416
Cameroon 24.625604 0.723672
Canada 4.849407 1.384825
Central African Republic 24.944802 0.744589
Chad 27.192427 0.860087
Chile 11.766404 0.790397
China 12.091490 0.984341
... ... ...
South Africa 16.358768 0.727290
South Korea 10.684621 0.734311
Spain 14.406040 1.467905
Sri Lanka 26.723132 0.860885
Sudan 28.029673 0.883212
Suriname 26.423671 0.667203
Swaziland 21.202131 0.736511
Sweden 5.633613 1.698500
Switzerland 7.510930 1.628820
Syria 18.150843 1.059578
Taiwan 21.684443 0.678906
Tajikistan 8.815343 0.969743
Tanzania 22.588050 0.608919
Thailand 26.678252 0.762719
Togo 26.640961 0.627027
Tunisia 18.747460 1.451424
Turkey 12.896425 1.452088
Turkmenistan 14.202935 1.065592
Uganda 24.017821 0.678057
Ukraine 7.777368 1.476684
United Arab Emirates 26.569680 0.842811
United Kingdom 9.077754 1.592413
United States 13.559916 1.431763
Uruguay 17.421169 0.771975
Uzbekistan 11.754961 1.161992
Venezuela 25.482626 0.714302
Vietnam 24.819564 0.703170
Yemen 25.769223 0.964631
Zambia 21.049689 0.810531
Zimbabwe 19.773964 0.709890

159 rows × 2 columns

In [18]:
# obteniendo dataframes para graficar por separado
def graficarPorPais(paises, pndfs, title="Temperatura Promedio por año (C°)"):
    data = []
    for pais in paises:
        ordenado = pndfs[pndfs["Country"] == pais].sort_values(by="dt")
        data.append(go.Scatter(
            x=ordenado["dt"],
            y=ordenado["avg(AverageTemperature)"],
            name=pais
        ))
    plotly.offline.iplot({
        "data": data,
        "layout": go.Layout(title=title)
    })
In [19]:
graficarPorPais(mas_frios, pndf_paises,"Temperatura °C de los 10 países mas fríos")
graficarPorPais(mas_calientes, pndf_paises, "Temperatura °C de los 10 países mas cálidos")

Distribución de temperautra promedio °C

In [20]:
#docker exec -i mycluster-master jupyter notebook --ip=0.0.0.0 --port=8889 --allow-root
#docker exec -i mycluster-master jupyter notebook --ip=0.0.0.0 --port=8889 --allow-root
data = [go.Histogram(x=pndf_paises.sample(len(pndf_paises)//5)["avg(AverageTemperature)"])]

plotly.offline.iplot({
    "data": data,
    "layout": go.Layout(title="Distribución de temperatura Promedio °C en países por años")
})
print("Total Muestras:", len(pndf_paises["avg(AverageTemperature)"]))
Total Muestras: 31556
In [21]:
ndf.show(5)
+-------------------+------------------+-----------------------------+-----------+--------+--------+---------+---------+
|                 dt|AverageTemperature|AverageTemperatureUncertainty|       City| Country|Latitude|Longitude|Elevation|
+-------------------+------------------+-----------------------------+-----------+--------+--------+---------+---------+
|1825-01-01 00:00:00|25.331999999999997|                        3.194|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|1825-02-01 00:00:00|25.549000000000003|           1.4709999999999999|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|1825-03-01 00:00:00|            26.285|                        2.193|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|1825-04-01 00:00:00|            26.999|                        2.571|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
|1825-05-01 00:00:00|27.450000000000006|                        1.591|Johor Bahru|Malaysia|   0.80N|  103.66E|        0|
+-------------------+------------------+-----------------------------+-----------+--------+--------+---------+---------+
only showing top 5 rows

Temperatura promedio en ciudades y elevación sobre el nivel del mar

In [22]:
pndt = ndfa.groupBy("Elevation").avg("AverageTemperature").toPandas()
print(len(pndt))
pndt.head(10)
692
Out[22]:
Elevation avg(AverageTemperature)
0 148 3.494045
1 2122 14.722362
2 1342 24.001516
3 1460 11.965502
4 1127 9.330125
5 737 21.663342
6 858 7.149741
7 540 14.550837
8 1143 10.221040
9 516 8.624268
In [23]:
pndt = pndt.sort_values(by="Elevation")
pndt.head(10)
Out[23]:
Elevation avg(AverageTemperature)
314 -29 10.496743
92 -27 9.936538
620 -26 8.068412
652 -3 8.327234
45 -1 9.756586
669 0 20.578962
139 1 18.423071
615 2 21.084107
164 3 22.129599
335 4 25.117423
In [24]:
#pndt = pndt.sort_values(by="Elevation", ignore_index=True)
pndt.columns
Out[24]:
Index(['Elevation', 'avg(AverageTemperature)'], dtype='object')
In [33]:
data = [go.Scatter(
            x=pndt["Elevation"],
            y=pndt['avg(AverageTemperature)'],
            name="temperatura °C", mode="markers"),
       ]

plotly.offline.iplot({
    "data": data,
    "layout": go.Layout(title="Temperatura (°C) con relación a elevación (msnm)")
})
In [26]:
# NOTA: Es incorrecto obtener el promedio de elevaciones
pndf_elevacion = ndf.groupBy('Country', 'dt').avg("AverageTemperature", "AverageTemperatureUncertainty", "Elevation").toPandas()
pndf_elevacion.head()
Out[26]:
Country dt avg(AverageTemperature) avg(AverageTemperatureUncertainty) avg(Elevation)
0 Malaysia 1855-09-01 00:00:00 26.071875 1.401469 73.0625
1 Malaysia 1856-04-01 00:00:00 26.221844 0.824969 73.0625
2 Malaysia 1879-02-01 00:00:00 25.754094 1.136281 73.0625
3 Malaysia 1886-03-01 00:00:00 26.510875 0.829031 73.0625
4 Malaysia 1905-09-01 00:00:00 26.371094 0.794250 73.0625
In [27]:
pndf_elevacion.corr()
Out[27]:
avg(AverageTemperature) avg(AverageTemperatureUncertainty) avg(Elevation)
avg(AverageTemperature) 1.000000 -0.275842 -0.15430
avg(AverageTemperatureUncertainty) -0.275842 1.000000 0.00412
avg(Elevation) -0.154300 0.004120 1.00000
In [28]:
pndf_elevacion.describe()
Out[28]:
avg(AverageTemperature) avg(AverageTemperatureUncertainty) avg(Elevation)
count 374572.000000 374572.000000 374572.000000
mean 17.110750 1.104389 363.881974
std 9.832992 1.197959 481.472668
min -31.986000 0.065000 0.000000
25% 10.837413 0.381556 35.888889
50% 19.674000 0.654287 228.357143
75% 25.142000 1.389425 428.000000
max 37.812000 15.213000 4007.000000